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Constructing an accurate system model for formal model verification can be both resource demanding 
and time-consuming. To alleviate this shortcoming, algorithms have been proposed for automatically 
learning system models based on observed system behaviors. In this paper we extend the algorithm 
on learning probabilistic automata to reactive systems, where the observed system behavior is in 
the form of alternating sequences of inputs and outputs. We propose an algorithm for automatically 
learning a deterministic labeled Markov decision process model from the observed behavior of a 
reactive system. The proposed learning algorithm is adapted from algorithms for learning deterministic 
probabilistic finite automata, and extended to include both probabilistic and nondeterministic transitions. 
The algorithm is empirically analyzed and evaluated by learning system models of slot machines. The 
evaluation is performed by analyzing the probabilistic linear temporal logic properties of the system 
as well as by analyzing the schedulers, in particular the optimal schedulers, induced by the learned 
models. 



1 Introduction 



Model checking is successfully used in many areas to check a formal system model against a specification 
given by a logical expression. However, to construct an accurate model of an industrial system is usually 
difficult and time consuming. The difficulty of model construction, or system modeling, is regarded by 
industry as a challenge to adopt other powerful model-driven development (MDD) techniques and tools 
as well. Meanwhile, the necessary accurate, updated and detailed documentations rarely exist for le gacy 
software or 3rd party components. Therefore, we consider system model learning techniques Ill2l 4l4j. ll6ll. 
which can automatically construct or learn an accurate high-level model from observations of a given 
black-box embedded system component. Afterwards, given a learned and explicitly represented model, 
model checking and other MDD techniques can be applied with other existing component models. 

For learning non-probabilistic system models, Angluin's approaches (2D has been well developed 
and implemented 01 [12, 14]. However, a disadvantage of those system models is that complex systems 
are often only partially observable via their interactions with the user. Even worse, the observation is 
often not noise-free. Compared with deterministic models, probabilistic models are more feasible to 
model a complicated real system and its physical components, unpredictable user interactions and the 
usage of randomized algorithms. In this paper, we focus on probabilistic models. Sen et al. (ldl adapted 
the algorithm from ojfor learning Markov chain models in puipose of verification. In 111 311 . a learning 
approach related to 01611 is developed, and strong theoretical and experimental consistency results are 
established. Considering a limited situation that the target system is not fully under control and only a 
single observation sequence is available, the algorithm for learning variable order Markov chains II 1 311 is 
developed to verify stationary system properties on the learned models Q6Q. 
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In Markov chains, probabilistic choices may serve to model and quantify possible outcomes of 
randomized actions or the interface between a system and its environment. This, nevertheless, requires 
abundant statistical experiments to obtain adequate distributions to model the average behavior of the 
environment. It is a natural choice to model by nondeterminism a system which is open for interaction 
from environment, system properties then need to be guaranteed for all potential environments llnll . 
Therefore, Markov decision processes (MDPs), which exhibit both nondeterministic and probabilistic 
behavior, are widely used for modeling reactive systems H. In this paper, we adapted the algorithm 
for learning deterministic probabilistic finite automata to include nondeterministic actions. Particularly, 
we learn deterministic labeled Markov decision processes (DLMDPs), where input actions are chosen 
nondeterministicaily and outputs given inputs are determined probabilistically, from the observed input 
and output behavior of a reactive system. This leads to another motivation of the learning puiposes. 
For large systems, we may be interested in only one component, and it receives certain inputs from the 
environment or other components. Then the learner can output a model which is the representation of 
this component. 

Besides model learning, statistical model checking (SMC) Bill 12011 techniques can also be used to 
analyze black-box systems. Statistical model-checking uses hypothesis testing based on sampling runs 
of a system that allows the user to check to a desired level of confidence whether a given logical property 
holds with a given (minimum) probability. Unfortunately, this technique is not well suited to MDPs since 
the presence of nondeterminism making running for sample paths is not well defined M\ without an extra 
scheduler. Moreover, the model output by the model learning approach can be used to other properties 
without re-sampling, as well as being used for other MDD tasks. 

The main contribution of this paper is the development of IOALERGIA algorithm for learning DLMDP, 
which is obtained as an adaptation of the previous ALERGIA |5|] algorithm. In order to demonstrate the 
applicability, the new algorithm is applied to learning models for slot machines from observed system 
behavior, which is in the form of alternating sequences of inputs and outputs. The evaluation is performed 
by analyzing and comparing probabilistic linear time properties in the learned model and the known 
generating model, as well as maximal expected reward and optimal schedulers. 

This paper is structured as follows: section |2] contains background material. Section [3] describes 
the procedure of generating learning data, while section @] describes IOALERGIA algorithm. Section [5] 
demonstrate its applicability through a case study concerning slot machine. Section [6] concludes the 
paper. 



2 Preliminaries 

2.1 Labeled Markov Decision Processes 

Definition 1 (LMDP) A labeled Markov decision processes (LMDP) is a tuple M = (Q,T,[,T,o, 7C, X,L) 

• Q is a finite set of states, 

• £/ is a finite input alphabet, and £o is a finite output alphabet, 

• K : Q — > [0, 1] is an initial probability distribution such that Y*qeQ ^{l) = 1> 

• z : Q x £/ x Q — > [0, 1] is the transition probability function such that for all q € Q and all a 6 E/, 

• L : 2 — > £o a labeling function. 



H. Mao, Y. Chen, M. Jaeger, T. D. Nielsen, K. G. Larsen, & B. Nielsen 



51 



An input a G £/ is enabled in state q G Q if and only if Y, q > e Q T (<7> a 5 ?') — !• Let Act{q) denote the set 
of enabled actions in q. 

Definition 2 (DLMDP) A LMDP is deterministic if 

• There exists a state q s G Q with 7r(q 5 ) = 1, 

• For allq EQ, OC G £/ a € So. exitf/tf at most one q' G <2 w/f/i L(<?') = CJ ««<i t(q,oc,q') > 0. 
W? ?/ien a/so write t(q, a, <j) instead ofz(q, OC,q'). 

2.2 Strings 

Let £o(£/£o)* and £o(£/£o) £0 denote the set of all finite, respectively infinite strings of alternative input 
and output symbols. For a finite string s = Ooa\ Ci . . . a n o n , a* € £/ and a,- € £o> tne set of all its prefixes 
is defined as: 

prefixfs) = {ooCCiOi . . . a k a k \ < k < n,k € N} 

For a set of strings S, prefix (S) denotes the set of all prefixes of strings s GS. We assume an lexicographic 
ordering on Lo(^i^oT ■ 

In a DLMDP there is a tight connection between strings and states: given an observed string s there is 
a unique state q that the LMDP must be in. Conversely, every state q is associated with the set strings (q) 
of all strings that lead from the start state to q. We therefore use symbols q for states and s for strings 
to some extent interchangeably: s can also denote the state in a DLMDP reached by the string s. The 
association of strings with states, on the other hand, is not one-to-one. We can still identify q with the 
lexicographically minimal s G strings (q), and may use q also to denote this string. 

2.3 Scheduler 

A scheduler [H] for a MDP M is a function 6 : Q + ->■ Lj such that 6(<?o<7i • € Ac*(# n ) for all 
qo,q\, . . . ,q n £ Q + . The scheduler chooses in any state g one action a G £/, and induces a Markov chain, 
i.e., the behavior of an MDP M under the decisions of scheduler 6 can be formalized by a Markov chain 
M e 0, Section 10.6]. 

A labeled Markov chain (LMC) M e of an LMDP M induced by a scheduler (3 defines a probability 
measure P Me on (£o) £0 which is the basis for associating probabilities with events in the LMC M e . The 
probability of a string 5 = OqOi . . . O n , a 6 Eo defined by M 6 is: 

f« 6 0) = f7 ^ OoCTl • • • °i 1 , Ot) 

!=1 

where T s is the transition probability function of M e . 

2.4 Probabilistic LTL 

Linear time temporal logic (LTL) over is defined as usual by the syntax 

<p ::=a | <pi A<p 2 | -■<?> | 0<P I a G E 

For better readability, we also use the derived temporal operators □ (always) and (eventually). 

Let <p be an LTL formula. For s = O0O1 . . . G (Eo) ffl , s[j ■■■] = CjOj + \Oj + 2 ■ • • is tne suffix of 5 starting 
with the (j)th symbol Oy. Then the LTL semantics for infinite words over Lq are as follows: 
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• s \= true 

• s \=<X, iff a = Go 

• s \= <pi A<pi, iff J |= <pi and 5 |= <p2 

• i |= -1 9, iff s¥ (p 

. 5 h 0?>.iffj[i-] h <p 

• i |= <pi U<p2, iff 3j > 0. s[j ...]\= (P2 and j[j . . .] |= <pi, for all < / < j 
The syntax of probabilistic LTL (PLTL) is: 

::=P w (p) (MG>, <, =; rG [0,1]; 9 € LTL) 

A labeled Markov decision process M satisfies the PLTL formula P^ r ((p) iff P M& (<p) X r for all 
schedulers of M, where P Me is the probability distribution defined by the LMC induced by a scheduler 
& of M, and P M(3 (<p) is short for P M& (s\s \= (p,s £ (E ) <a ) 

The quantitative analysis of an MDP M against specification (p amounts to establishing the lower and 
upper bounds that can be guaranteed, when ranging over all schedulers. This corresponds to computing 

Pr x (<p) = su P P M6 (?) and P™» = MP Me (q>) 

6 S 

where the infimum and the supremum are taken over all schedulers for M. 

3 Data Generation 

The data we learn from is generated by observing the running reactive system. From the system we 
can observe input actions which determine probability distributions over successor states, and outputs 
which are labels of successor states. The learning algorithm requires that all nondeterministic choices 
are resolved by a fair scheduler & which means each input action will be chosen infinitely often. We 
assume that the input and output will be observed alternately, and every observation sequence starts from 
the label of the initial state, and ends in a state, i.e. OqO,\0\ . . . a n o n , with a,- € S/ and a,- G Hq. 

Usually, enabled and disabled actions for states in a black-box system are unknown. Therefore, we 
allow that all actions can be chosen on each state of the system. For enabled actions, the system will 
transit to other states, and the input and the corresponding label of the successor state will be collected. 
For disabled actions, the system will stay in the same state but give a special error message. Through 
this setting, enabled and disabled inputs could be distinguished. Furthermore, we denote the prompted 
error by err, thus the output alphabet £q is extended to £0 U {err}. Due to the memoryless scheduler, 
the same disabled input on the same state could be chosen more than once, and the statistic information 
about err will be found necessary in the following compatibility test. 

After all nondeterministic choices have been resolved, let SfjS®,... be an independent family of 
P Me -distributed random variables (with values in £q ffl X and L\,L2, ■ ■ . be an independent family 
of integer-valued random variables, such that the L,- are also independent of the Sf. We assume that we 
observe the finite observation sequences Sj := OqGCiOi . . . CLLjGi^, i.e., the first L, symbols of Sf. Thus, we 
observe the independent run of the system for a period of time that is determined independently of the 
observed behavior (in particular, the observation does not automatically end when the system enters a 
deadlock or failure state - such a situation would rather lead to repeated deadlock or failure observations 
in the final part of the sequence). We assume that the L, are unbounded, i.e. P(L; > k) > for all k G N. 
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This will be satisfied by a geometric distribution for the L,. For some models, there exists a uniquely 
labeled absorbing state which can be identified by its observation (e.g., a failure state which can not 
recover). When prior knowledge is available, observations can be stopped when the model reaches that 
state. 

Finally, we denote with S[n] = S\, . . . ,S n the sample consisting of the first n observations. 

4 Learning 

IOalergia for learning DLMDP consists of two phases. Firstly, represent the data as I/O frequency 
prefix tree acceptor (IOFPTA) where common prefixes are combined together. Then, do compatibility 
test on the tree following lexicographical order. If two states are compatible which requires that the next 
state distributions given the same input are compatible, they and their successor states will be merged 
correspondingly. 

4.1 IOFPTA 

The input and output frequency prefix tree acceptor IOFPTA is constructed as a representation of the 
set of strings S which captures the behavior of the reactive system under observation. Since in DLMDP, 
same sequences will lead to the same state, then in IOFPTA common prefixes are merged together and 
result in a tree shaped automaton. Each node in the tree is labeled by an output symbol a G Zo, and each 
edge is labeled by an input action a G £/. Every path from the root to a node corresponds to a string 
s G prefix(5). The node s is associated with the frequency function f(s,a,a) (a G £/, a G which 
is the number of strings in S with the prefix sao, and f(s,a) = Lcj e i; /(s,G:,a). From one node in 
IOFPTA, given an input action and an output symbol, the next state can be uniquely determined. An 
IOFPTA can be transformed to DLMDP by normalizing frequencies f(s, a, •) to x(s, a, •). As assumed 
in data generation phase, when the scheduler chooses a disabled input on a state in LMDP, the model 
will stay in the current state, and output the symbol err. We are going to take the special meaning of the 
err symbol into account in the IOFPTA construction. Specifically, s and saerr would lead to the same 
state from the root state. We will take the special treatment for the err symbol, but there is no difference 
between it and other symbols in learning. A new node labeled by err will not be created as a successor 
node or we can say that the err nodes are folded up. 

Example 1 IOFPTA 




Figure 1: (a) A DLMDP over I = {A,B,C,err} and £/ = {a,/3}; (b) The corresponding IOFPTA. 

The IOFPTA in FigU^b) is constructed from sample sequences generated by a DLMDP M in Fig\T(a). 
The root node is labeled by A. From the root, given input a, successor nodes which are labeled by B and 
C, will be reached by strings with the prefix AaB or AaC, respectively. For the frequency, f(A,a,B) = 15 
and f(A, OC,C) = 1. The input action j8 is disabled in the state with label C of (a). Then the tree will stay 
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in node labeled by C when we meet the input j8 (which is drawn and linked by dash lines in FigU}b)). 
Then for each node the incoming frequencies are not equivalent to the outgoing frequencies. 

4.2 IOalergia 

IOalergia algorithm, is an adapted version of the Alergia algorithm flS]- As seen in Example Q] 
the same state in generating LMDP could be reached by more than one sequences through running, 
which will create more than one node in the IOFPTA. The basic idea of this learning algorithm is to 
approximate the generating model by grouping together the nodes in IOFPTA which can be mapped 
to the same state in the generating model. The partition which is introduced by grouping nodes will be 
inferred by pairwise testing. The compatibility of two nodes is tested by comparing distributions defined 
by nondeterministic choices, and recursively testing on successor nodes. If two nodes in the tree pass the 
compatibility test which means they can be mapped to the same state in the generating model, then they 
will be merged, as well as their successor nodes. 
Algorithm 1 IOalergia 
Input: : A dataset S and a parameter e G (0, 1]; 
Output: : ADLMDPA; 

1: r,A^IOFPTA(5); 

2: RED <— qf; 

3: BLUE <— {q \ q = q^cto, a G S/, a G Zo,^aa G prefix(5)}; /* immediate successor states */ 
4: while BLUE^0 do 

5: qb 4— lexicographically minimal q G BLUE; 

6: merged <— false; 

7: for q r G RED /* in lexicographic order */ do 

8: if Compatible (T ,q r ,qi,,e) then 

9: A<-Merge(A,q r ,q b ); 
10: merged <— true; 

ll: end if 
12: end for 
13: if \merged then 
14: RED 4— RED U {qb}', 
15: end if 

16: BLUE 4- BLUE\{q b }U {q = q r ao | a Gl/,a G L ,q G prefix (S), q r G RED, q <£ RED} ; 
17: end while 

18: return makeDLMDP(A); I* normalize */ 

In the learning algorithm, firstly, two IOFPTAs T and A are constructed as the representation of 
the dataset S (line Q] of the Algorithm [T])- The IOFPTA T is kept as a data representation from which 
relevant statistics are retrieved during the execution of the algorithm. The IOFPTA A is iteratively 
transformed by merging nodes which have passed the compatibility test. All compatibility is tested on 
T, and the reason for this is that it has a clear interpretation as empirical probabilities defined by the 
data S. Following the teiminology from [§], Algorithm [T] maintains two sets of states: RED states, 
which have already been determined as representative states of partitions and will be included in the final 
output DLMDP, and BLUE states which are going to be tested. Initially, RED contains only the initial 
state while BLUE contains the immediate successor states of the initial state. During iterations, the 
lexicographically minimal node qb in BLUE will be chosen. If there exists a state q r in RED which is 
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compatible with qy, then qi, and its successor nodes are going to be merged into q r and its corresponding 
successor states. If qb is not compatible with any state in RED, it will be included in RED. At the end 
of each iteration, BLUE is going to be updated as the margin between RED and the remaining states, 
in the other word, the set of states which are immediate successor states of RED but not included in it. 
After merging all possible compatible nodes in the tree, the frequencies in A are going to be normalized 
by the Algorithm |T1 (line [T8T). Then a DLMDP is constructed. 

4.3 Compatibility Test 

Algorithm |2] demonstrates the compatibility test. It will return true if two nodes are compatible, i.e., the 
distance of distributions for every action is within the Hoeffding bound flf^l . Algorithm [2 parameterized 
by £. Formally, two nodes q r and qi, are e-compatible (1 > e > 0), if it holds that: 

1. L(q r ) =L(q b ) 

2. Hoeffding(f(q r ,a,o),f(q n a),f(qb,a,o),f(qb,a),s) is TRUE, for all aeZ; and a G L . 

3. Nodes q r ao and qbCCG are e-compatible, for all a € £/, and ctgIq 

Condition 1) requires two nodes in the tree to have the same label. Condition 2) defines the compatibility 
between each outgoing transition with the same input action respectively from state q,- and qj,. The last 
condition requires the compatibility to be recursively satisfied for every pair of successors of q r and 
qb- If two nodes in IOFPTA are compatible, then distributions for all input actions should pass the 
compatibility test. 

In the original Alergia algorithm, termination probabilities of two nodes are compared, while not in 
Algorithm |2] The reason is that the termination probability is not included in the definition of DLMDR In 
Algorithm [2 the distance of two empirical probabilities are compared with the Hoeffding bound. If there 
is few, even none, statistical evidence to support their difference, the distance is small. In particular, two 
nodes are compatible, if there is no evidence against that. The err information is used to discriminate two 
nodes which have different enabled actions. For example, there are q\ and q2, and input action a is only 
enabled on q\. For q\, f(q\,a,a) >0,(T^ err and f{q\ , a, err) = 0, while f(q2,cc) = f(q2,cc,err) > 0. 
Comparing the empirical probability distribution over £q including err, q\ and q2 can not be compatible. 

4.4 Merge states 

If two states q r and qb are compatible, qb will be merged to q r . The Merge procedure (line |9]of the 
Algorithm [Qi follows the same way as described in ^ : firstly, the (unique) transition leading to qb from 
its predecessor node q' (f A (q', cc,qb) > 0) is re-directed to q r by setting f A (q' ,a,q r ) <— f A (q', oc,qb) and 
f A (q' ,a,qb) = 0. Then, successor nodes of qb will be recursively folded to the corresponding successor 
nodes of q r . 

Example 2 Merge States 

Fig. |2] shows the procedure that the node qb ( shadowed) will be merge to the node q r ( shadowed 
double circle). In (a), the transition from the node q' to qb firstly redirected to q r . In (b), transitions from 
qb to three successor nodes labeled with A, B and C, will be folded into the corresponding successor 
nodes of q r , respectively, (c) illustrates the result after merge. 
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Algorithm 2 Compatible 

Input: : IOFPTA T , nodes q r and q h , e G (0, 1] 

Output: : true if q r and qb are compatible 

l: if L(q r ) ^ L(q b ) then 

2: return false 

3: end if 

4: for a G £/ do 

5: for a G do 

6: if \Hoeffding{ f (q r ,a,a),f \q r ,a)f T (q h ,a,a),f (q b ,a),e) then 
7: return /a/se 

8: end if 

9: if \Compatible(T,q r OCO,qbOCO,s) then 
10: return /a/re 

ll: end if 
12: end for 
13: end for 
14: return true 



Algorithm 3 Hoeffding 
Input: : /i,ni,/2,n 2 ,e G (0,1] 

Output: : true if and ./^/'^ are sufficiently close 
1: if n\ == or ni == then 
2: return ??w 

3: end if 



4: return 





4.5 Discussion 



The algorithm takes a set S of sample sequences and a parameter £ as inputs. Here e is used to bound 
the type-I error, which is the probability of wrongly rejecting a correct compatibility hypothesis. Smaller 
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values of e lead to loose Hoeffding bounds and making IOalergia output a smaller model. For any 
particular finite sample size we try to tune the choice of e so as to obtain the best approximation to the 
real model. In order to do this we run IOALERGIA with different e values, and evaluate the learned 
model using the Bayesian Information Criterion (BIC) score. This score combines the likelihood of a 
model with a term penalizing model complexity. Concretely, the BIC score of a DLMDP A given data S 
is defined as 

BIC(A | S) := log(P A (S)) - 1/2 \A \ log(N) 

where | A | =| Q \ - | £/ [ ■ | Lo | is the number of free parameters in the model. N is the number symbols 
in the data. Using a golden section search iflii Section E.l.l] we systematically search for an e value 
maximizing the BIC score of the learned model. Our algorithm is implemented in Matlab and is available 
for download at |http : //mi . cs . aau . dk/code/ioalergia[ 

A convergence analysis, similar to the analysis in ibifkill for deterministic Markov chain models, can 
be obtained for IOalergia: first, one can show that in the large sample limit, IOalergia will identify 
up to bisimulation equivalence the structure of the true model from which the data was sampled; the 
structure of a model refers to all of its components, except the probability values of transitions. Second, 
the parameters in the learned model will converge to the corresponding parameter values in the true 



model. As a slight refinement of Theorem 2 in 111 311 . one then obtains that for any LTL formula <p: 



P(KmP$r(<p) =C ax (<P)) = l,and P( lim P™(<p) = P™») = ^ 



where A n is the DLMDP returned by IOALERGIA on data S[n]. As also observed in [13Q, similar results 
do not carry over to PCTL formulas. 



5 Experiments 

In this section, we are going to show the applicability of the IOALERGIA algorithm using a case study 
based on the slot machine [i9]. The slot machine we considered has 3 reels, named as reel-l, reel-2 
and reel-3, and each reel contains 5 different symbols: lemon, grape, cherry, bar and, apple. The slot 
machine will return a prize based on the combination of symbols on those 3 reels. The prizes for different 
configurations are shown in Tabled! a). We extend the basic gambling machine as follows: at each round 
the player can choose one of the reels to spin, and other reels will be kept. The player starts with paying 
1 coin for first 3 spins, and afterwards each extra spin costs 1 additional coin. Each reel must be spun 
at least once, and the player can quit the game only if all reels have been spun. The behavior of the 
slot machine contains both probabilistic and nondeterministic aspects. Specifically, the symbol show for 
each reel is probabilistic, but the choice of which reel to spin is nondeterministic. 

In the following parts of this section, the algorithm will be applied for learning deterministic and 
nondeterministic models for different number of spins. A memoryless and random scheduler with 
a uniform distribution over all input actions, that modeling the fair requirement, is used in the data 
generation procedure. For experiment, we analyze the behavior of learned models by comparing them 
with known generating models in terms of maximal and minimal probabilities of winning a specific 
reward as well as the maximal expected reward in general. These probabilities and rewards are all 



computed by PRISM IllOh . We will also analyze the accuracy that the optimal action in the learned 



model given symbols on reels and number of times the reels have been spun. 



58 



Learning Markov Decision Processes for Model Checking 



Table 1: Prize 
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Table 2: Summery of slot machines 





Deterministic 


Nondeterministic 




Slot Machine 


Slot Machine 


N 


I<2| 


Tran 


Ifil 


Tran 


4 


437 


4021 


510 


4959 


6 


867 


10721 


1012 


13291 


8 


1297 


17421 


1514 


21623 


10 


1727 


24121 


2016 


29955 



5.1 Learning models from Deterministic systems 

We implemented the slot machine in PRISM. The distribution for 3 reels showing different symbols 
are (0.2,0.2,0.1,0.3,0.2), (0.2,0.1,0.3,0.2,0.2), and (0.2,0.3,0.2,0.1,0.2), respectively. In this model, 
there are 4 actions: spin reel-l (sp\), spin reel-2 (spz), spin reel-3 (sps), and get the prize (pay), thus 
£/ = {sp\ ,sp2,sp3,pay}. Every state is labeled by the combination of states on the 3 reels and the 
number of times the reels have been spun. We also attached reward variables to the states which are 
labeled by prize. Table |2] shows statistics for models with various number of spins. Here, N (N >3) is 
the number of spins, \Q\ is the number of states, and |Tran| is the number of transitions. 



The generating model is a deterministic LMDP The results of applying the learning algorithm for 
different data sets are produced by the generating model are summarized in Table |3j |5| is the number of 
symbols in the dataset (x 10 3 ), |Seq| is the number of sequences in the dataset; |IOFPTA| is the number 
of nodes in the IOFPTA; Time is the learning time (in seconds), including the time for constructing 
IOFPTA and the average time for each iteration performed by the golden section search (typically the 
golden section search terminated after 14 to 19 iterations); '£ range' is the interval (identified using the 
golden section search) for e for which a BIC-optimal DLMDP is learned, \Q\ is the number of states in 
the learned model. 

Fig. [3fa) and (b) show the maximal and minimal probabilities of eventually getting different prizes 
using the P max (0L coins) and P min (0L coins), where L€ {0, 1,2,5, 10} (on both generating model and 
learned models, for N = 4,6,8, 10). As the size of dataset increases, the learned models provide better 
approximations of the maximal and minimal probabilities defined for the generating models. Using 
PRISM, the maximal expected reward for one gamble (7? max (0 stop)) can be computed. In Fig. [3tc), 
for various initially bought spin chances, the maximal expected rewards for the learned models (dashed 
lines) are all approaching the ones for the generating models as the sizes of the datasets increase. 

The optimal action which reel to spin next for a specific configuration of the reels, can also be 
accurately preserved by learned models. For example, given that there are three apples on reels and we 
only have 1 spin left, the best choice is to spin the 3rd reel since taking any other action will not produce 
a prize. We consider the 125 configurations where every reel has been spun once. Given a specific 
configuration Q, the optimal action in the learned model and the generating model are denoted as Act\ 
and Actf, respectively. We define a criterion which interpret the accuracy of optimal actions inferred by 
the learned model against the generating model as follows: 



Acc =L!f> max ( c <')- 



| Act\ CiActf 
I Actf | 
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Table 3: Experimental results for the slot machine models. 



|S|(xl0 3 ) 


|Seq| 


|IOFPTA| 


Time 


e range 


101 




160 


5832 


20915 


9.8 


[0.0020; 0.1552] 


436 


II 


640 


23246 


48373 


29.9 


[0.0020; 0.1552] 


437 




1280 


46374 


64064 


50.2 


[0.0020; 0.1250] 


437 




160 


5779 


33829 


16.0 


[0.0020; 0.1553] 


866 


II 


640 


23154 


122458 


46.9 


[0.0020; 0.1553] 


867 




1280 


46273 


231029 


84.9 


[0.0010; 0.0776] 


867 




640 


23054 


148225 


66.1 


[0.0020; 0.1553] 


1297 


oc 

II 


1280 


46242 


283749 


116.6 


[0.0010; 0.0776] 


1297 




2000 


72284 


429555 


153.0 


[0.0010; 0.0776] 


1297 




1280 


46241 


317794 


142.5 


[0.0005; 0.0388] 


1725 


o 


2000 


72250 


482943 


184.0 


[0.0005; 0.0313] 


1727 


II 


5000 


180755 


1135055 


454.4 


[0.00006; 0.0040] 


1727 



Where, P max (C,) is the maximal probability of reaching configuration Q. As shown in Fig. [3](d), by 
increasing the size of dataset, the learned models have almost the same optimal actions as the generating 
models. Even with very limited data amount, accuracies for optimal actions in learned models are always 
greater than 25%, which is the probability of randomly choosing an optimal action. 



5.2 Learning models from Nondeterministic systems 



In order to make the slot machine more interesting, we increase the prize for three bars but reduce the 
probability of getting that. This is done by adding another bar on reel-2, two bars, denoted as b\ and b%, 
that are indistinguishable, but have different mechanical characteristics. The probability for these two 
bars depend on the symbols on other two reels. 

The distributions for all reels are shown in Table 4(a) and Table 4(b) Since reels are no longer 
independent, we name refer to machine as hooked slot machine. In this machine, the probability of 
getting 3 bars is decreased, but the reward for getting 3 bars is 20 coins. Every other configuration has the 
same prize as the previous game. After this modification, the generating model becomes nondeterministic, 
and its statistics listed in Table 12 

Table 4: Probability distributions for 3 reels 



(a) Probability distributions for the 1st and the 3rd reel 



(b) Probability distributions for 2nd reel 





lemon 


grape 


cheny 


bar 


apple 






r\=b 


r^=b 


r\ , r3 =b 


other 




r 2 = b\ 


0.2 


0.2 


0.1 


0.3 


0.2 




lemon 


0.2 


0.2 


0.26 


0.2 




ri =b 2 


0.3 


0.2 


0.1 


0.05 


0.35 




grape 


0.1 


0.1 


0.1 


0.1 


K 


other 


0.25 


0.2 


0.1 


0.15 


0.3 




cherry 


0.3 


0.3 


0.3 


0.3 




r 2 = b\ 


0.2 


0.3 


0.2 


0.05 


0.25 




bar 1 


0.18 


0.02 


0.02 


0.1 


"53 


ri =b 2 


0.1 


0.3 


0.2 


0.3 


0.1 




bar 2 


0.02 


0.18 


0.02 


0.1 


Si 


other 


0.2 


0.3 


0.2 


0.15 


0.15 




apple 


0.2 


0.2 


0.3 


0.2 



In this experiment, 



we apply IOalergia for learning DLMDPs 



from data generated by the nondeterministic 
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real N=4 
learned N=4 
real N=6 
learned N=6 
real N=8 
learned N=8 
real N=10 
learned N=1Q 




1280 2000 5000 



Number of symbols (*10 ) 

(c) 



Number of symbols (*10 ) 

(d) 



Figure 3: Evaluation results for learning deterministic models. Fig (a) and (b): The maximal and 
minimal probabilities of eventually being awarded L coins given 4, 6, 8, and 10 initial spins, here 
Le {0, 1,2,5, 10}. As shown, the model for N = 4 is learned from 1280 x 10 symbols, and models for 
N = 6,8, 10 are all learned from 5000 x 10 3 symbols. Fig (c), shows maximal rewards (7? max (0 stop)) 
in learned models and the generating model. In Fig (d), the accuracy of the optimal action the learned 
models is shown. 



models. The learning results are summarized in Table [5] where each column has the same meaning as 
in Table [3] Given sufficient dtat, we observed that learned models have the same number of states as 
the deterministic models of the previous slot machine, thus the states introduced by the extra symbol on 
reel-2 was not get identified. The reason is that states labeled by b\ on reel-2 and b^ on reel-2 are mixed 
and generally observed as bar on reel-2. 

Fig. @] shows maximal and minimal probabilities for getting different prizes, maximal rewards from 
the initial state and the accuracy of the optimal action. Given adequate data, learned deterministic models 
provide good approximations for nondeterministic generating models in terms of maximal probability, 
minimal probability and the maximal expected reward. On the other hand, the accuracy of choosing 
optimal action in next step is no longer as good as before. Nevertheless, the suggestion given by learned 
model is still better than random choice (which has 25% accuracy) in most cases. 

The generating model is a nondeterministic LMDP, so there is no guarantee that the learned model 
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Table 5: Experimental results for hooked slot machines. 



|5|(xl0 3 ) 


|Seq| 


Tree 


time 


e range 


\Q\ 


160 

H 640 

^ 1280 


5794 
23185 
46308 


20768 
48530 
64354 


9.7 
29.8 
51.5 


[0.0020 
[0.0020 
[0.0010 


0.1552] 
0.1250] 
0.0776] 


437 
437 
437 


160 

1 640 
^ 1280 


5737 
23174 
46380 


33755 
122575 
231260 


15.7 
46.6 
84.0 


[0.0039 
[0.0020 
[0.0005 


0.2500] 
0.1552] 
0.0388] 


867 
867 
867 


640 
°^ 1280 
^ 2000 


23143 
46260 
72212 


148730 
284310 
430102 


63.7 
112.7 
166.1 


[0.0020 
[0.0010 
[0.0005 


0.1552] 
0.0776] 
0.0313] 


1297 
1297 
1297 


1280 
2 2000 
i 5000 


46371 
72360 
180781 


318423 
483696 
1135149 


138.6 
202.5 
460.3 


[0.0010 
[0.0005 
[0.0010 


0.0776] 
0.0313] 
0.0625] 


1723 
1724 
1725 



preserves all PLTL properties . For example, suppose there are two bars after two spins, and corresponding 
to the configurations 'bar, bar, not-spun' (Ci), 'bar, not-spun, bar' (C2), and 'not-spun, bar, bar' (C3). 
From these configurations, we can calculate the maximal probability of getting 3 bars after next spin 
(see Table [6]). The maximal probability in the generating model for different N are the same since there 
is still one reel is that has not been spun. We can observe that conditional probabilities in learned models 
are quite different from the ones in generating models. 



Table 6: conditional probability 





real 


N=4 


N=6 


N=8 


N=10 


P(3xbars\C l ) 


0.30 


0.0714 


0.0356 


0.0327 


0.0450 


P(3 x bars | C 2 ) 


0.04 


0.0551 


0.0659 


0.0934 


0.0701 


P(3 x bars | C 3 ) 


0.30 


0.0940 


0.0835 


0.0874 


0.0885 



6 Conclusion 

In this paper, we have proposed the IOalergia algorithm for learning deterministic labeled Markov 
processes (DLMDPs). Given sequences of alternating input and output symbols, the algorithm can 
automatically construct a model, for the reactive system under observation, and we have similar convergence 



result of the IOALERGIA algorithm as given in H 1 3fl for deterministic Markov chain models. The 



algorithm is empirically analyzed using a case study based on slot machines. The learning results are 
evaluated by comparing in terms of PLTL properties and maximal expected rewards of both the learned 
model with the known generating models as well as the accuracy of optimal actions derived from the 
learned models. 

Compared to the learning algorithm for deterministic automata 111411 . further research is required 
to make the learning algorithm that suitable for routine use. In addition to empirically demonstrating 
the learned model is a good approximation, measuring the distance between the learned model and the 
generating model will be part of our future work. For compositional systems, this learning approach 
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Figure 4: Evaluation results for learning nondeterministic models, (a) and (b) are the maximal and 
minimal probability of eventually being awarded L coins L G {0, 1,2,5, 10}. The size of each dataset is 
the same as Fig. [3] (c): maximal rewards computed by 7? max (0 stop) in learned models and generating 
models, (d): the accuracy of optimal actions suggested by learned models. 



could be extended to learn models for each individual component from the observed interaction among 
components. Moreover, the approach for learning DLMDP could be refined by active learning techniques 
that take advantage of interactive data acquisition. 
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